Biomedical Data Science School

Day 2 Part 1: ChIP-Seq

Edinburgh, MRC Institute of Genetics and Molecular Medicine Tuesday, 12th March

General Topics

  • Introduction to epigenomics
  • ChIP-Seq Differential Modification Calling
    • Pre-processing
    • Peak Calling
    • Differential Peak Calling (Occupancy, Affinity, Shape)
  • Analysing DNA methylation
    • Reduced Representation Bisulfit Sequencing (RRBS)
    • Differential Methylation Calling on CpGs
    • Differentially Methylated Regions
    • Annotation

Schedule

  • 09:30 - 11:00 (Lecture) Epigenomics 1: Analysing Chromatin

  • 12:30 - 14:00 Lunch break

  • 14:00 - 16:00 (Hands-on) ChIP-Seq Analysis Pipeline

  • 16:00 - 17:00 (Hands-on) RRBS Analysis Pipeline

Lead Instructor

Gabriele Schweikert | email

Co-Instructor(s)

David Helekal | email

Learning Objectives

  • Get a basic understanding of epigenomic mechanisms and their significance for physiological and pathological cell functioning.
  • Overview of experimental techniques to measure epigenomic footprints with focus on ChIP-Seq and BS-Seq
  • Explore and apply a basic ChIP-Seq Data Analysis Pipeline including Differential Modification Calling (using DiffBind and MMDiff)
  • Explore and apply a basic BS-Seq Data Analysis Pipeline including Calling of differentially methylated CpGs and differentially methylated regions (using methylKit and bsseq)
  • Learn how to annotate results.

Bioconductor Packages

Bioconductor has many packages which support the analysis of high-throughput sequence data; currently there are more than 70 packages that can be used to analyse ChIP-seq data. A list, and short description of available packages can be found here: BiocView ChIP-Seq

Bioconductor has scheduled releases every 6 months, with these releases new versions of the packages will become available. the Bioconductor project ensures that all the packages within a release will work together in harmony (hence the “conductor” metaphor).

If you haven’t installed the packages that we need to run this tutorial you will need to do so now.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("knitr")
BiocManager::install("kableExtra")
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
BiocManager::install('TxDb.Mmusculus.UCSC.mm10.knownGene')
BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
BiocManager::install("DiffBind")
BiocManager::install('MotifDb')
BiocManager::install('methylKit')
BiocManager::install('genomation')
BiocManager::install('ggplot2')
BiocManager::install("AnnotationHub")
BiocManager::install("annotatr")
BiocManager::install("bsseq")
BiocManager::install("DSS")

Additionally we will have to install the provided MMDiff package:

install.packages(repos=NULL,'MMDiff3_1.6.1.tgz')
library(MMDiff3)

Check that all packages can be loaded:

Documentation packages:

library("knitr")
library("kableExtra")

Analysis packages:

library("DiffBind")
## Loading required package: SummarizedExperiment
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## Loading required package: BiocParallel
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:Biostrings':
## 
##     type
## The following objects are masked from 'package:base':
## 
##     aperm, apply
## 
library("MMDiff3")

Annotation packages:

library("TxDb.Hsapiens.UCSC.hg38.knownGene")
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
library("BSgenome.Hsapiens.UCSC.hg38")
## Loading required package: BSgenome
## Loading required package: rtracklayer
library('MotifDb')
## See system.file("LICENSE", package="MotifDb") for use restrictions.

ChIP-Seq Data Analysis

During this course we shall look at the analysis of a typical (very simple) ChIP-Seq experiment. We will start from the fastq files, and briefly look at alignment and pre-processing steps. These will not be run during the tutorial due to time constraints. Instead you will be provided with pre-processed bam files. However, by following the instructions you will be able to create the provided files on your own.

We will focus on the Histone modification H3K4me3 and will try to detect differences in the genomic distribution of this epigenomic mark between two different cell lines: human embryonic stem cells H1 and fetal lung cells, myofibroblasts, IMR90 cells. We will follow two different strategies: 1) It is well known that the modification H3K4me3 is predominantly found around gene promoters, and we will thus use annotated genes to define regions of interests (ROIs) around their promoters. We will try to find H3K4me3 differences in these regions. 2) We will also follow a more data driven approach where we use a peak caller to identify regions of significant enrichment relative to the background and we will also do a differential modification analysis in those regions.

Data

During the tutorial we will use data from the Roadmap Epigenomics Project.

After locating the data Epigenomics_metadata.xlsx we have stored the relevant SRR accession numbers in a SRR_table file which we have used to downloaded the data using fastq_dump:

# less SRR_table | parallel "fastq-dump --outdir fastq  --gzip --skip-technical  --readids --read-filter pass --dumpbase --split-3 --clip {}"

We have already prepared the data sets for you. Please download the data here: http://bifx-core.bio.ed.ac.uk/Gabriele/public/Data.tar.gz.

Preprocessing Data

We have preprocessed the data sets for you and will not do these steps in class. We will talk you through the different steps such that you can do it on your own data. You will use a number of command-line tools, most of which you have encountered in the last couple of days:

  • fastqc, a quality control tool for high throughput sequence data.

  • mulitqc to aggregate results from bioinformatics analyses across many samples into a single report.

  • trim_galore, a wrapper tool around Cutadapt to apply quality and adapter trimming to FastQ files.

  • Bowtie2, is a fast and memory-efficient tool for aligning sequencing reads to long reference sequences.

  • Samtools, a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly.

Additionally, we use the GNU Wget package for retrieving files using HTTP, HTTPS, FTP and FTPS and the GNU parallel tool for executing jobs in parallel using one or more computers, such that different fastq files can be processed all at the same time rather than sequentially.

Quality Control

Before we continue, we would like to assess the quality of each data file that we have downloaded so far. To do that we use the fastqc tool, e.g.:

cd fastq
fastqc IMR90-H3K4me3-1-1.fastq.gz 

This generates a file IMR90-H3K4me3-1-1_fastqc.html and a corresponding IMR90-H3K4me3-1-1_fastqc.zip. By opening the html file we can inspect the result here. As you can see, there are several issues with this file and we will have to use trim_galore to filter the reads.

As we want to run fastqc for each fastq file in the directory, we would like to speed up the process by using the GNU parallel function. Here we use the ls function to list all files ending on ‘.fastq.gz’, we then pass all these file names on to the fastqc tool using the | (pipe) functionality.


ls *fastq.gz | parallel "fastqc {}"

Trimmming

As we have observed in the fastqc report, many reads have low quality towards the end of the read. Additionally, we have adapter contamination. To remove these issues we run trim_galore, which will also create another fastqc report:

ls *fastq.gz | parallel -j 30 trim_galore  --stringency 3 --fastqc -o trim_galore/ {}

Compare the overall quality of the remaining reads after trimming with the original report: fastqc.

Eventually, we use multqc to aggregate the results.

ls *.fastq.gz | parallel fastqc
multiqc

The multiqc reports can be viewed here.

Alignment

bowtie2-build Annotation/Homo_sapiens.a.dna.primary_assembly.fa GRCh38 
ls *_trimmed.fq.gz | parallel -j 30 bowtie2 -x GRCh38 -U {} -S {}.sam
ls *.sam | parallel "samtools view -bS {} | samtools sort - -o {}.bam"
ls *.bam | parallel "samtoots index {}"

Merge files and subset for tutorial

To merge files from the different lanes:

samtools merge IMR90-H3K4me3-1.bam IMR90-H3K4me3-1-1_trimmed.fq.gz.sam.bam IMR90-H3K4me3-1-2_trimmed.fq.gz.sam.bam

Next we need to index these bam files

samtools index IMR90-H3K4me3-1.bam
samtools view -h IMR90-H3K4me3-2.bam 19 > chr19-IMR90-H3K4me3-2.sam

samtools view -bS chr19-IMR90-H3K4me3-2.sam > chr19-IMR90-H3K4me3-2.bam

These bam and index files are available for you in the Bowtie2 sub-directory of your downloaded EpiCourse2018_Data Folder.

Creating Sample Sheet

We will now create a sample sheet that captures the most important information about the data. To do that open a new text file in RStudio, copy the information below into the file, remove empty spaces and replace them with commas and save it as a comma separated file (.csv), sampleSheet.csv file. Note that for later analysis, it is important to include the precise column names and remove newline and empty spaces.

SampleID Tissue Factor Condition Replicate bamReads ControllID bamControl
IMR90-H3K4me3-1 IMR90 H3K4me3 Ctr 1 Data/ChIP-Seq/Bowtie2/chr19-IMR90-H3K4me3-1.bam IMR90-Input1 Data/ChIP-Seq/Bowtie2/chr19-IMR90-Input.bam
IMR90-H3K4me3-2 IMR90 H3K4me3 Ctr 2 Data/ChIP-Seq/Bowtie2/chr19-IMR90-H3K4me3-2.bam IMR90-Input2 Data/ChIP-Seq/Bowtie2/chr19-IMR90-Input.bam
H1-H3K4me3-1 H1 H3K4me3 Ctr 1 Data/ChIP-Seq/Bowtie2/chr19-H1-H3K4me3-1.bam H1-Input1 Data/ChIP-Seq/Bowtie2/chr19-H1-Input-2.bam
H1-H3K4me3-3 H1 H3K4me3 Ctr 3 Data/ChIP-Seq/Bowtie2/chr19-H1-H3K4me3-3.bam H1-Input2 Data/ChIP-Seq/Bowtie2/chr19-H1-Input-2.bam

Defining Regions of Interests

To analyse the ChIP-Seq data sets we first have to think about, the regions which we want to examine in detail. One obvious choice is to look at all those regions where the ChIP signal is significantly increased relative to the background signal. We will use a Peak caller (MACS2) shortly to detect these regions. However, the performance of peak callers depends on the right set of parameters, the signal-to-noise-ratio of your data and a lot of other things. So, sometimes it is also interesting to use prior information about the data to inform your choice of regions. For example, it is well known that H3K4me3 is found predominantly around promoters of actively transcribed genes. As genes a reasonably well annotated in human, it is worth to also take a ‘supervised’ approach, where we look at defined windows around transcription start sites (TSS). This is what we are doing in the following. After that we will also look at enriched regions called by MACS2.

Use Annotation to Define Promoter Regions

Bioconductor provides an easy R interface to a number of prefabricated databases that contain annotations. You will learn more about available databases here: AnnotationDbi One such package is TxDb.Hsapiens.UCSC.hg38.knownGene which accesses UCSC build hg19 based on the knownGene Track. Here we are first interested in annotated genes in this database:

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
G = genes(txdb, columns="gene_id", filter=NULL, single.strand.genes.only=TRUE)

Have a look at the object:

G
## GRanges object with 24183 ranges and 1 metadata column:
##             seqnames              ranges strand |     gene_id
##                <Rle>           <IRanges>  <Rle> | <character>
##           1    chr19   58345178-58362751      - |           1
##          10     chr8   18391245-18401218      + |          10
##         100    chr20   44619522-44651742      - |         100
##        1000    chr18   27950966-28177446      - |        1000
##   100009613    chr11   70072434-70075433      - |   100009613
##         ...      ...                 ...    ... .         ...
##        9991     chr9 112217716-112333667      - |        9991
##        9992    chr21   34364024-34371389      + |        9992
##        9993    chr22   19036282-19122454      - |        9993
##        9994     chr6   89829894-89874436      + |        9994
##        9997    chr22   50523568-50526439      - |        9997
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
summary(width(G))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##       41     5801    21374    69152    61456 36632075
hist(width(G)[width(G)<100000]/1000,breaks = 1000, main = 'Histogram of Gene Lengths',xlab='gene length in kbp')

For simplicity (and to practice R) we would like to look at genes which are longer than 2000bp but smaller than 1Mbp, are on chromosome 19 and are also having a gap to their neighboring genes of at least 2000bp.

longGenes = G[width(G)>2000 & width(G)<100000 & seqnames(G)=='chr19']
summary(width(longGenes))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2009    8592   15902   22173   30162   98903
hist(width(longGenes)/1000,breaks = 1000,main = 'Histogram of Filtered Gene Lengths',xlab='gene length in kbp')

We will next filter out overlapping genes or genes which are close to a neighboring gene.

ov = findOverlaps( G,longGenes,maxgap=2000)
ii = which(duplicated(subjectHits(ov)))
OverlappingGenes = longGenes[subjectHits(ov)[ii]]
nonOverlappinglongGenes = longGenes[-subjectHits(ov)[ii]]

Test if we didn’t make a mistake:

ov = findOverlaps(nonOverlappinglongGenes ,G)

For the filtered genes we next look at promoter regions:

Promoters = promoters(nonOverlappinglongGenes ,upstream=2000, downstream=200)
Promoters
## GRanges object with 838 ranges and 1 metadata column:
##             seqnames            ranges strand |     gene_id
##                <Rle>         <IRanges>  <Rle> | <character>
##   100073347    chr19 56838902-56841101      + |   100073347
##   100128252    chr19 56475603-56477802      + |   100128252
##   100128398    chr19 58000061-58002260      + |   100128398
##   100128568    chr19   5976403-5978602      + |   100128568
##   100128675    chr19 35106105-35108304      - |   100128675
##         ...      ...               ...    ... .         ...
##        9667    chr19   5623847-5626046      - |        9667
##        9668    chr19 52048621-52050820      - |        9668
##         970    chr19   6603904-6606103      - |         970
##         976    chr19 14378501-14380700      + |         976
##        9817    chr19 10503542-10505741      - |        9817
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

For our toy example we will only use a random subset of 500 of these regions. We will use the random number generator but set a seed such that our code is reproducible.

set.seed(1237628)
idx = sort(sample(x=length(Promoters),size=500,replace=FALSE))
candPromoters = Promoters[idx] 

Now save your object in case you want to use it again later. You can also write it out as a bed file.

save(file='Data/ChIP-Seq/RoI/candPromoters.rData',candPromoters,nonOverlappinglongGenes)

df <- data.frame(seqnames=seqnames(candPromoters),
  starts=start(candPromoters)-1,
  ends=end(candPromoters),
  names=paste('Promoter-',seq(1,length(candPromoters)),sep=''),
  scores=c(rep(1, length(candPromoters))),
  strands=strand(candPromoters))

write.table(df, file="Data/ChIP-Seq/RoI/candPromoters.bed", quote=F, sep="\t", row.names=F, col.names=F)

Detecting Enriched Regions

Next, we will examine regions that are detected to be significantly enriched by a peak caller. We are using MACS2 and we are trying to find enriched regions in each of the ChIP samples relative to the cell specific Input. We have already prepared this step for you, so you do not need to run the following steps. (Note, that if you run it on your own, you need to use the shell / terminal not R console.)

macs2 callpeak -t Bowtie2/chr19-IMR90-H3K4me3-1.bam  -c  Bowtie2/chr19-IMR90-Input.bam  -g hs -q 0.01  --call-summits -n IMR90-H3K4me3-1 --outdir MACS2
 
macs2 callpeak -t Bowtie2/chr19-IMR90-H3K4me3-2.bam  -c  Bowtie2/chr19-IMR90-Input.bam  -g hs -q 0.01  --call-summits -n IMR90-H3K4me3-2 --outdir MACS2
 


macs2 callpeak -t Bowtie2/chr19-H1-H3K4me3-1.bam  -c  Bowtie2/chr19-H1-Input-2.bam  -g hs -q 0.01  --call-summits -n H1-H3K4me3-1 --outdir MACS2
 
macs2 callpeak -t Bowtie2/chr19-H1-H3K4me3-3.bam  -c  Bowtie2/chr19-H1-Input-2.bam  -g hs -q 0.01  --call-summits -n H1-H3K4me3-3 --outdir MACS2

These Files are available for you in the MACS2 sub-directory of your downloaded EpiCourse2018_Data Folder.

Differential Region (Occupancy) Analysis (DiffBind)

Now, we are finally ready for the real thing: Finding differences between the two cell lines, H1 and IMR90. For this task we will use the DiffBind Bioconductor package.

First, we will append our SampleSheet with columns specifying the MACS2 called enriched regions for each sample:

SampleID Tissue Factor Condition Replicate bamReads ControllID bamControl Peaks PeakCaller
1 IMR90-H3K4me3-1 IMR90 H3K4me3 Ctr 1 Data/ChIP-Seq/Bowtie2/chr19-IMR90-H3K4me3-1.bam IMR90-Input1 Data/ChIP-Seq/Bowtie2/chr19-IMR90-Input.bam Data/ChIP-Seq/MACS2/IMR90-H3K4me3-1_peaks.xls macs
2 IMR90-H3K4me3-2 IMR90 H3K4me3 Ctr 2 Data/ChIP-Seq/Bowtie2/chr19-IMR90-H3K4me3-2.bam IMR90-Input2 Data/ChIP-Seq/Bowtie2/chr19-IMR90-Input.bam Data/ChIP-Seq/MACS2/IMR90-H3K4me3-2_peaks.xls macs
3 H1-H3K4me3-1 H1 H3K4me3 Ctr 1 Data/ChIP-Seq/Bowtie2/chr19-H1-H3K4me3-1.bam H1-Input1 Data/ChIP-Seq/Bowtie2/chr19-H1-Input-2.bam Data/ChIP-Seq/MACS2/H1-H3K4me3-1_peaks.xls macs
4 H1-H3K4me3-3 H1 H3K4me3 Ctr 3 Data/ChIP-Seq/Bowtie2/chr19-H1-H3K4me3-3.bam H1-Input2 Data/ChIP-Seq/Bowtie2/chr19-H1-Input-2.bam Data/ChIP-Seq/MACS2/H1-H3K4me3-1_peaks.xls macs

(Note this file is also provided for you in here This file will be used to create a DBA object: Initially only the meta data is used and the peak regions are loaded.

  DBA <- dba(sampleSheet="Data/ChIP-Seq/SampleSheet.csv")

Examine the DBA object: It shows you the number of enriched regions discovered in each sample (the Intervals column). It also shows you the number of consensus peaks (1509) at the top of the output.

  DBA
## 4 Samples, 1509 sites in matrix (1580 total):
##                ID Tissue  Factor Condition Replicate Caller Intervals
## 1 IMR90-H3K4me3-1  IMR90 H3K4me3       Ctr         1   macs      1776
## 2 IMR90-H3K4me3-2  IMR90 H3K4me3       Ctr         2   macs      1797
## 3    H1-H3K4me3-1     H1 H3K4me3       Ctr         1   macs      2142
## 4    H1-H3K4me3-3     H1 H3K4me3       Ctr         3   macs      2142

To can get an idea how well the enriched regions correlate we can for example plot the DBA object. It is reassuring to see that similar regions are called in the replicate samples:

  plot(DBA)

You can next go ahead and further analysis the differences in enriched regions detected for the two cell lines. Further details can be found in the DiffBind vignette.

We are not going to do this here, instead we are going to create a consensus peak set, which we will use for our differential modification analysis:

  consensus_peaks <- dba.peakset(DBA, bRetrieve=TRUE)
  save(file='Data/ChIP-Seq/RoI/MACS2consensus_peaks.rData',consensus_peaks)

df <- data.frame(seqnames=seqnames(consensus_peaks),
  starts=start(consensus_peaks)-1,
  ends=end(consensus_peaks),
  names=paste('MACS2consensus-',seq(1,length(consensus_peaks)),sep=''),
  scores=c(rep(1, length(consensus_peaks))),
  strands=strand(consensus_peaks))

write.table(df, file="Data/ChIP-Seq/RoI/MACS2consensus_peaks.bed", quote=F, sep="\t", row.names=F, col.names=F)

Differential Modification (Affinity) Analysis (DiffBind)

We are not only interested in where we find the modification on the genome, but also how much we find there. We therefore continue with a Differential Modification Analysis on the consensus peak set. To do that we need to count how many reads overlap with each peak in each of the samples:

  DBA <- dba.count(DBA)

Once we have counted the reads, we can again observe how well samples correlate, or alternatively perform principal component analysis.

  plot(DBA)

  dba.plotPCA(DBA,DBA_TISSUE,label=DBA_CONDITION)

The next step, will be a statistical test to determine regions which are significantly different between H1 and IMR90 cells. In this case, we have to set a contrast between the different tissues. (In other cases, we might want to find differences between conditions, e.g. control vs treatment, we would then set categories=DBA_CONDITION)

  DBA <- dba.contrast(DBA,categories=DBA_TISSUE,minMembers=2)

DiffBind allows access to several methods for statistical testing of count data, most notable EdgeR and DESeq2. The Default method is (DESeq2)[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8], which was initially developed for RNA-Seq data sets. Note, that there are also a number of normalization options. Here we will use the default normalization. It is important to think about this and to explore the DiffBind vignette further. The results can change massively when using a different normalization method.

  DBA <- dba.analyze(DBA)
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates

The new DBA object has a contrast field added. It shows that you are comparing a group IMR90 with 2 members to a group H1 with also two members. Using DESeq it has found 727 peaks to be differentially modified between the two groups.

  DBA
## 4 Samples, 1509 sites in matrix:
##                ID Tissue  Factor Condition Replicate Caller Intervals FRiP
## 1 IMR90-H3K4me3-1  IMR90 H3K4me3       Ctr         1 counts      1509 0.92
## 2 IMR90-H3K4me3-2  IMR90 H3K4me3       Ctr         2 counts      1509 0.89
## 3    H1-H3K4me3-1     H1 H3K4me3       Ctr         1 counts      1509 0.79
## 4    H1-H3K4me3-3     H1 H3K4me3       Ctr         3 counts      1509 0.77
## 
## 1 Contrast:
##   Group1 Members1 Group2 Members2 DB.DESeq2
## 1  IMR90        2     H1        2       727

We next examine the results:

  dba.plotMA(DBA)

  dba.plotVolcano(DBA)

  DBA.DB <- dba.report(DBA)
  DBA.DB 
## GRanges object with 727 ranges and 6 metadata columns:
##        seqnames            ranges strand |      Conc Conc_IMR90   Conc_H1
##           <Rle>         <IRanges>  <Rle> | <numeric>  <numeric> <numeric>
##    447       19 13838771-13843376      * |      8.21       9.13      5.08
##    886       19 39514797-39516610      * |      6.99       1.05      7.98
##    833       19 38264388-38265875      * |      7.01       2.57      7.98
##   1147       19 47713298-47714904      * |      7.88       8.74      5.41
##    630       19 19860238-19861389      * |      6.81       0.16      7.81
##    ...      ...               ...    ... .       ...        ...       ...
##   1442       19 57239914-57241846      * |      8.44       8.67      8.17
##    101       19   2598914-2599608      * |      5.02       5.61      4.02
##   1394       19 55115999-55117640      * |      7.62       7.91      7.27
##   1003       19 43934252-43935818      * |      7.98       8.24      7.66
##    920       19 40714576-40719638      * |      8.95       9.16      8.71
##             Fold   p-value       FDR
##        <numeric> <numeric> <numeric>
##    447      4.05  6.14e-51  9.27e-48
##    886     -6.93  9.76e-43  7.36e-40
##    833      -5.4  3.13e-42  1.57e-39
##   1147      3.34  6.63e-39   2.5e-36
##    630     -7.65  2.47e-35  7.46e-33
##    ...       ...       ...       ...
##   1442       0.5    0.0226    0.0472
##    101      1.58    0.0229    0.0477
##   1394      0.64    0.0229    0.0477
##   1003      0.58    0.0231     0.048
##    920      0.45     0.024    0.0499
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Differential Modification (Shape) Analysis (MMDiff2)

Peaks = dba.peakset(DBA,bRetrieve=TRUE)


ExperimentData <- list(genome='BSgenome.Hsapiens.UCSC.hg38',
                       dataDir=".",
                       sampleSheet ="Data/ChIP-Seq/SampleSheet.csv")

MetaData <- list('ExpData' = ExperimentData)
MMD <- DBAmmd(MetaData)

MMD <- setRegions(MMD, Peaks)
MMD <- getPeakReads(MMD)
##  [1] 36 34 26 32 28 35 29 33 23 27 30 25 24 20 31 22 21
##  [1] 36 32 28 35 33 34 31 27 30 25 29 20 26 24 21 23 22
##  [1] 36 35 22 33 34 28 32 29 23 27 30 26 21 24 25
##  [1] 36 33 32 31 35 34 28 21 22 27 20 25 26 23 24 30 29
## [1] 33 36 35 27 28 23
## [1] 35 27 36 34 32 33
MMD <- estimateFragmentCenters(MMD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -20.0   130.0   140.0   135.8   150.0   180.0 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     110     120     130     130     140     160 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -30.0   120.0   140.0   134.6   160.0   280.0 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    90.0   120.0   130.0   129.1   140.0   170.0 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -380.00 -180.00  -20.00  -28.89   82.50  340.00 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -320.00 -155.00  -20.00  -39.09   75.00  270.00
MMD <- compHists(MMD)
MMD <- compDists(MMD, dist.method = "MMD")
MMD <- setContrast(MMD,contrast='byTissue')
MMD <- compPvals(MMD)
MMD.DB <- reportResults(MMD)
plotDists(MMD, dist.method='MMD',whichContrast=1,
                          diff.method='MMD.locfit',
                          bUsePval=FALSE, th=0.1,
                          title=NULL, what=3,
                          xlim=NULL,ylim=NULL,Peak.IDs=NULL,
                          withLegend=TRUE)

Compare Results between MMDiff and DiffBind

length(intersect(names(MMD.DB),names(DBA.DB)))
## [1] 466
sum(!is.element(names(DBA.DB),names(MMD.DB)))
## [1] 261
sum(!is.element(names(MMD.DB),names(DBA.DB)))
## [1] 175
MMD.DBA = MMD.DB[is.element(names(MMD.DB),names(DBA.DB))]
MMD.exclusive = MMD.DB[!is.element(names(MMD.DB),names(DBA.DB))]
DBA.exclusive = DBA.DB[!is.element(names(DBA.DB),names(MMD.DB))]

Example peaks detected by both methods:

Example peaks detected by MMD only:

Example peaks detected by DBA only:

Comparison in MMD space:

DBA.exclusive

MMD.exclusive

MMD and DBA

Comparison in FC space:

MMDiff exclusive:

DBA exclusive:

DBA and MMDiff:

Session Info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MotifDb_1.24.1                         
##  [2] BSgenome.Hsapiens.UCSC.hg38_1.4.1      
##  [3] BSgenome_1.50.0                        
##  [4] rtracklayer_1.42.2                     
##  [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
##  [6] GenomicFeatures_1.34.4                 
##  [7] AnnotationDbi_1.44.0                   
##  [8] DiffBind_2.10.0                        
##  [9] SummarizedExperiment_1.12.0            
## [10] DelayedArray_0.8.0                     
## [11] BiocParallel_1.16.6                    
## [12] matrixStats_0.54.0                     
## [13] kableExtra_1.0.1                       
## [14] knitr_1.22                             
## [15] MMDiff3_1.6.1                          
## [16] Biobase_2.42.0                         
## [17] Rsamtools_1.34.1                       
## [18] Biostrings_2.50.2                      
## [19] XVector_0.22.0                         
## [20] GenomicRanges_1.34.0                   
## [21] GenomeInfoDb_1.18.2                    
## [22] IRanges_2.16.0                         
## [23] S4Vectors_0.20.1                       
## [24] BiocGenerics_0.28.0                    
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.3          GOstats_2.48.0          
##   [3] Hmisc_4.2-0              plyr_1.8.4              
##   [5] lazyeval_0.2.1           GSEABase_1.44.0         
##   [7] splines_3.5.2            BatchJobs_1.7           
##   [9] ggplot2_3.1.0            amap_0.8-16             
##  [11] digest_0.6.18            htmltools_0.3.6         
##  [13] GO.db_3.7.0              gdata_2.18.0            
##  [15] magrittr_1.5             checkmate_1.9.1         
##  [17] memoise_1.1.0            BBmisc_1.11             
##  [19] cluster_2.0.7-1          limma_3.38.3            
##  [21] readr_1.3.1              annotate_1.60.1         
##  [23] systemPipeR_1.16.1       prettyunits_1.0.2       
##  [25] colorspace_1.4-0         blob_1.1.1              
##  [27] rvest_0.3.2              ggrepel_0.8.0           
##  [29] xfun_0.5                 dplyr_0.8.0.1           
##  [31] crayon_1.3.4             RCurl_1.95-4.12         
##  [33] graph_1.60.0             genefilter_1.64.0       
##  [35] brew_1.0-6               survival_2.43-3         
##  [37] sendmailR_1.2-1          glue_1.3.0              
##  [39] gtable_0.2.0             zlibbioc_1.28.0         
##  [41] webshot_0.5.1            Rgraphviz_2.26.0        
##  [43] scales_1.0.0             pheatmap_1.0.12         
##  [45] DBI_1.0.0                edgeR_3.24.3            
##  [47] Rcpp_1.0.0               viridisLite_0.3.0       
##  [49] xtable_1.8-3             progress_1.2.0          
##  [51] htmlTable_1.13.1         foreign_0.8-71          
##  [53] bit_1.1-14               Formula_1.2-3           
##  [55] AnnotationForge_1.24.0   htmlwidgets_1.3         
##  [57] httr_1.4.0               gplots_3.0.1.1          
##  [59] RColorBrewer_1.1-2       acepack_1.4.1           
##  [61] pkgconfig_2.0.2          XML_3.98-1.19           
##  [63] nnet_7.3-12              locfit_1.5-9.1          
##  [65] labeling_0.3             tidyselect_0.2.5        
##  [67] rlang_0.3.1              later_0.8.0             
##  [69] munsell_0.5.0            tools_3.5.2             
##  [71] RSQLite_2.1.1            evaluate_0.13           
##  [73] stringr_1.4.0            bit64_0.9-7             
##  [75] caTools_1.17.1.2         purrr_0.3.1             
##  [77] splitstackshape_1.4.6    RBGL_1.58.1             
##  [79] mime_0.6                 xml2_1.2.0              
##  [81] biomaRt_2.38.0           compiler_3.5.2          
##  [83] rstudioapi_0.9.0         tibble_2.0.1            
##  [85] geneplotter_1.60.0       stringi_1.3.1           
##  [87] highr_0.7                lattice_0.20-38         
##  [89] Matrix_1.2-15            pillar_1.3.1            
##  [91] data.table_1.12.0        bitops_1.0-6            
##  [93] httpuv_1.4.5.1           R6_2.4.0                
##  [95] latticeExtra_0.6-28      hwriter_1.3.2           
##  [97] promises_1.0.1           ShortRead_1.40.0        
##  [99] KernSmooth_2.23-15       gridExtra_2.3           
## [101] gtools_3.8.1             assertthat_0.2.0        
## [103] DESeq2_1.22.2            Category_2.48.1         
## [105] rjson_0.2.20             GenomicAlignments_1.18.1
## [107] GenomeInfoDbData_1.2.0   hms_0.4.2               
## [109] grid_3.5.2               rpart_4.1-13            
## [111] rmarkdown_1.11           shiny_1.2.0             
## [113] base64enc_0.1-3